Explaining away ambiguity: Learning verb selectional preference 

with Bayesian networks* 
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Abstract 

This paper presents a Bayesian model for unsu- 
pervised learning of verb selectional preferences. 
For each verb the model creates a Bayesian 
network whose architecture is determined by 
the lexical hierarchy of Wordnet and whose 
parameters are estimated from a list of verb- 
object pairs found from a corpus. "Explaining 
away" , a well-known property of Bayesian net- 
works, helps the model deal in a natural fash- 
ion with word sense ambiguity in the training 
data. On a word sense disambiguation test our 
model performed better than other state of the 
art systems for unsupervised learning of selec- 
tional preferences. Computational complexity 
problems, ways of improving this approach and 
methods for implementing "explaining away" in 
other graphical frameworks are discussed. 

1 Selectional preference and sense 
ambiguity 

Regularities of a verb with respect to the seman- 
tic class of its arguments (subject, object and 
indirect object) are called selectional prefer- 
ences (SP) (Katz and Fodor, 1964; Chomsky, 
1965; Johnson-Laird, 1983). The verb pilot car- 
ries the information that its object will likely be 
some kind of vehicle; subjects of the verb think 
tend to be human; and subjects of the verb bark 
tend to be dogs. For the sake of simplicity we 
will focus on the verb-object relation although 
the techniques we will describe can be applied 
to other verb-argument pairs. 
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by NSF awards 9720368, 9870676 and 9812169. 



something 
FOOD 




ORGANISM 



entity 

LIQUID PHSYSICALJOBJECT 



food 




TEA beverage COFFEE drink earth land ISLAND ground CAPE 



coffee ESPRESSO JAVA-1 



JAVA-2 BALI 



Figure 1: A portion of Wordnet. 



Models of the acquisition of SP are impor- 
tant in their own right and have applications in 
Natural Language Processing (NLP). The selec- 
tional preferences of a verb can be used to infer 
the possible meanings of an unknown argument 
of a known verb; e.g., it might be possible to 
infer that xxxx is a kind of dog from the follow- 
ing sentence: "The xxxx barked all night". In 
parsing a sentence selectional preferences can be 
used to rank competing parses, providing a par- 
tial measure of semantic well-formedness. In- 
vestigating SP might help us to understand the 
structure of the mental lexicon. 

Systems for unsupervised learning of SP usu- 
ally combine statistical and knowledge-based 
approaches. The knowledge-base component 
is typically a database that groups words into 
classes. In the models we will see, the knowl- 
edge base is Wordnet (Miller, 1990). Word- 
net groups nouns into classes of synonyms 
representing concepts, called synsets, e.g., 
{car, auto, automobile, .. .}. A noun that be- 
longs to several synsets is ambiguous. A tran- 



sitive and asymmetrical relation, hyponymy, 
is denned between synsets. A synset is a hy- 
ponym of another synset if the former has the 
latter as a broader concept; for example, BEV- 
ERA GE is a hyponym of LIQ UID. Figure 1 de- 
picts a portion of the hierarchy. 

The statistical component consists of 
predicate-argument pairs extracted from a 
corpus in which the semantic class of the words 
is not indicated. A trivial algorithm might 
get a list of words that occurred as objects 
of the verb and output the semantic classes 
the words belong to according to Wordnet. 
For example, if the verb drink occurred with 
water and water £ LIQUID, the model would 
learn that drink selects for LIQUID. As Resnik 
(1997) and Abney and Light (1999) have found, 
the main problem these systems face is the 
presence of ambiguous words in the training 
data. If the word java also occurred as an 
object of drink, since java € BEVERAGE 
and java £ ISLAND, this model would learn 
that drink selects for both BEVERAGE and 
ISLAND. 

More complex models have been proposed. 
These models, though, deal with word sense 
ambiguity by applying an unselective strategy 
similar to the one above; i.e., they assume that 
ambiguous words provide equal evidence for all 
their senses. These models choose as the con- 
cepts the verb selects for those that are in com- 
mon among several words (e.g., BEVERAGE 
above). This strategy works to the extent that 
these overlapping senses are also the concepts 
the verb selects for. 

2 Previous approaches to learning 
selectional preference 

2.1 Resnik's model 

Ours system is closely related to those proposed 
in (Resnik, 1997) and (Abney and Light, 1999). 
The fact that a predicate p selects for a class 
c, given a syntactic relation r, can be repre- 
sented as a relation, selects(p,r, c); e.g., that 
eat selects for FOOD in object position can 
be represented as selects{eat, object, FOOD). 
In (Resnik, 1997) selectional preference is quan- 
tified by comparing the prior distribution of 
a given class c appearing as an argument, 
P(c), and the conditional probability of the 
same class given a predicate and a syntac- 



FOOD 1/4 



COGNITION 1/4 



ESSENCE 1/4 FLESH 1/4 FRUIT 1/2 BREAD 1/2 DAIRY 1/2 




idea( 0) meat( 1 ) 



apple( 1 ) bagel( 1 ) cheese( 1 ) 



Figure 2: Simplified Wordnet. The numbers 
next to the synsets represent the values of 
freq(p, r, c) estimated using (3), the numbers in 
parentheses represent the values of freq(p, r, w). 



tic relation P(c\p,r), e.g., P(FOOD) and 
P(FOOD\eat, object). The relative entropy be- 
tween P(c) and P(c\p, r) measures how much 
the predicate constrains its arguments: 



S(p,r) = D(P(c\p,r) || P(c)) 



(1) 



Resnik defines the selectional association of 

a predicate for a particular class c to be the por- 
tion of the selectional preference strength due to 
that class: 



A(p, r, c) = 



S(p,r) 



P(c\ P ,r) log^^ (2) 



Here the main problem is the estimation of 
P(c\p, r). Resnik suggests as a plausible esti- 
mator P(c\p, r) = f freq(p,r,c)/freq(p,r). But 
since the model is trained on data that are not 
sense-tagged, there is no obvious way to esti- 
mate freq{p,r,c). Resnik suggests considering 
each observation of a word as evidence for each 
of the classes the word belongs to, 



freq(p, r, c) « ^ 



count(p, r, w) 
classes(w) 



(3) 



where count{p, r, w) is the number of times 
the word w occurred as an argument of p in 
relation r, and classes(w) is the number of 
classes w belongs to. For example, suppose 
the system is trained on (eat, object) pairs and 
the verb occurred once each with meat, ap- 
ple, bagel, and cheese, and Wordnet is simpli- 
fied as in Figure 2. An ambiguous word like 
meat provides evidence also for classes that ap- 
pear unrelated to those selected by the verb. 
Resnik's assumption is that only the classes se- 
lected by the verb will be associated with each 




2X7 

BREAD C ) DAIRY 



Figure 3: The HMM version of the simple ex- 
ample. 



of the observed words, and hence will receive 
the highest values for P(c\p, r). Using (3) we 
find that the highest frequency is in fact as- 
sociated with FOOD: freq(eat, object, food) rs 
\ + \ + h + h = l and P{FOOD\eat) = 0.44. 
However, some evidence is found also for COG- 

\ and 



NITION: freq(eat, object, cognition) 
P(COGNITION\eat) = 0.06. 



2.2 Abney and Light's approach 

Abney and Light (1999) pointed out that the 
distribution of senses of an ambiguous word is 
not uniform. They noticed also that it is not 
clear how the probability P(c\p, r) is to be inter- 
preted since there is no explicit stochastic gen- 
eration model involved. 

They proposed a system that associates 
a Hidden Markov Model (HMM) with each 
predicate-relation pair (p,r). Transitions be- 
tween synset states represent the hyponymy re- 
lation, and e, the empty word, is emitted with 
probability 1; transitions to a final state emit a 
word w with probability < P(w) < 1. Tran- 
sition and emission probabilities are estimated 
using the EM algorithm on training data that 
consist of the nouns that occurred with the verb. 
Abney and Light's model estimates P(c\p, r) 
from the model trained for (p,r); the distri- 
bution P(c) can be calculated from a model 
trained for all nouns in the corpus. 

This model did not perform as well as ex- 
pected. An ambiguous word in the model can 
be generated by more than one state sequence. 
Abney and Light discovered that the EM al- 
gorithm finds parameter values that associate 
some probability mass with all the transitions 
in the multiple paths that lead to an ambigu- 



ous word. In other words, when there are sev- 
eral state sequences for the same word, EM does 
not select one of them over the others. 1 Figure 3 
shows the parameters estimated by EM for the 
same example as above. The transition to the 
COGNITION state has been assigned a proba- 
bility of 1/8 because it is part of a possible path 
to meat. The HMM model does not solve the 
problem of the unselective distribution of the 
frequency of occurrence of an ambiguous word 
to all its senses. Abney and Light claimed that 
this is a serious problem, particularly when the 
ambiguous word is a frequent one, and caused 
the model to learn the wrong selectional pref- 
erences. To correct this undesirable outcome 
they introduced some smoothing and balancing 
techniques. However, even with these modifica- 
tions their system's performance was below that 
achieved by Resnik's. 

3 Bayesian networks 

A Bayesian network (Pearl, 1988), or 
Bayesian belief network (BBN), consists of a set 
of variables and a set of directed edges con- 
necting the variables. The variables and the 
edges define a directed acyclic graph (DAG) 
where each variable is represented by a node. 
Each variable is associated with a finite number 
of (mutually exclusive) states. To each variable 
A with parents B\,...,B n is attached a condi- 
tional probability table (CPT) P(A\B U B n ). 
Given a BBN, Bayesian inference can be used 
to estimate marginal and posterior proba- 
bilities given the evidence at hand and the in- 
formation stored in the CPTs, the prior prob- 
abilities, by means of Bayes' rule, P{H\E) = 

P ^ H p(E^ H ^ > wnere H stands for hypothesis and 
E for evidence. 

Bayesian networks display an extremely inter- 
esting property called explaining away. Word 
sense ambiguity in the process of learning SP de- 
fines a problem that might be solved by a model 
that implements an explaining away strategy. 
Suppose we are learning the selectional prefer- 
ence of drink, and the network in Figure 4 is the 

'As a matter of fact, for this HMM there are (in- 
finitely) many parameter values that maximize the like- 
lihood of the training data; i.e., the parameters are not 
identifiable. The intuitively correct solution is one of 
them, but so are infinitely many other, intuitively incor- 
rect ones. Thus it is no surprise that the EM algorithm 
cannot find the intuitively correct solution. 




Figure 4: A Bayesian network for word ambigu- 
ity. 



knowledge base. The verb occurred with java 
and water. This situation can be represented 
as a Bayesian network. The variables ISLAND 
and BEVERAGE represent concepts in a se- 
mantic hierarchy. The variables java and water 
stand for possible instantiations of the concepts. 
All the variables are Boolean; i.e., they are as- 
sociated with two states, true or false. Suppose 
the following CPTs define the priors associated 
with each node. 2 The unconditional probabili- 
ties are P{I = true) = P(B = true) = 0.01 and 
P(I = false) = P(B = false) = 0.99, and the 
CPTs for the child nodes are 





P{X = x\Y 1 =yi,Y 2 =y 2 ) 




I,B 




-J,B 


-J,-lB 


j = true 
j = false 


0.99 
0.01 


0.99 
0.01 


0.99 
0.01 


0.01 
0.99 


w = true 
w — false 


0.99 
0.01 


0.99 
0.01 


0.01 
0.99 


0.01 
0.99 



These values mean that the occurrence of either 
concept is a priori unlikely. If either concept is 
true the word java is likely to occur. Similarly, 
if BEVERAGE occurs it is likely to observe also 
the word water. As the posterior probabilities 
show, if java occurs, the beliefs in both concepts 
increase: P(I\j) = P(B\j) = 0.3355. However, 
water provides evidence for BEVERAGE only. 
Overall there is more evidence for the hypoth- 
esis that the concept being expressed is BEV- 
ERAGE and not ISLAND. Bayesian networks 
implement this inference scheme; if we compute 
the conditional probabilities given that both 
words occurred, we obtain P(B\j, w) = 0.98 and 
P(I\j,w) = 0.02. The new evidence caused the 
"island" hypothesis to be explained away] 

3.1 The relevance of priors 

Explaining away seems to depend on the spec- 
ification of the prior probabilities. The priors 

2 1, B, j and w abbreviate ISLAND, BEVERAGE, 
java and water, respectively. 



Q COGNITION Q FOOD 



ESSENCE ( ) FLESH { ) FRUIT ( ) BREAD { ) DAIRY 



idea meat apple bagel^Q cheese 

Figure 5: A Bayesian network for the simple 
example. 



define the background knowledge available to 
the model relative to the conditional probabili- 
ties of the events represented by the variables, 
but also about the joint distributions of several 
events. In the simple network above, we de- 
fined the probability that either concept is se- 
lected (i.e., that the corresponding variable is 
true) to be extremely small. Intuitively, there 
are many concepts and the probability of ob- 
serving any particular one is small. This means 
that the joint probability of the two events is 
much higher in the case in which only one of 
them is true (0.0099) than in the case in which 
they are both true (0.0001). Therefore, via the 
priors, we introduced a bias according to which 
the hypothesis that one concept is selected will 
be favored over two co-occurring ones. This is a 
general pattern of Bayesian networks; the prior 
causes simpler explanations to be preferred over 
more complex ones, and thereby the explaining 
away effect. 

4 A Bayesian network approach to 
learning selectional preference 

4.1 Structure and parameters of the 
model 

The hierarchy of nouns in Wordnet defines a 
DAG. Its mapping into a BBN is straightfor- 
ward. Each word or synset in Wordnet is a 
node in the network. If A is a hyponym of B 
there is an arc in the network from B to A. All 
the variables are Boolean. A synset node is true 
if the verb selects for that class. A word node is 
true if the word can appear as an argument of 
the verb. The priors are defined following two 
intuitive principles. First, it is unlikely that a 
verb a priori selects for any particular synset. 
Second, if a verb does select for a synset, say 
FOOD, then it is likely that it also selects for 
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its hyponyms, say FRUIT. The same principles 
apply to words: it is likely that a word appears 
as an argument of the verb if the verb selects for 
any of its possible senses. On the other hand, 
if the verb does not select for a synset, it is 
unlikely that the words instantiating the synset 
occur as its arguments. "Likely" and "unlikely" 
are given numerical values that sum up to 1. 
The following table defines the scheme for the 
CPTs associated with each node in the network; 
Pi(X) denotes the ith parent of the node X. 





P(X = x\pi(X)V, \/p n (X) = true) 


x = true 


likely 


x = false 


unlikely 




P(X = x\pi(X)A, Ap n (X) = false) 


x = true 


unlikely 


x = false 


likely 



For the root nodes, the table reduces to the 
unconditional probability of the node. Now 
we can test the model on the simple example 
seen earlier. W + is the set of words that oc- 
curred with the verb. The nodes correspond- 
ing to the words in W + are set to true and 
the others left unset. For the previous ex- 
ample W + = {meat, apple, bagel, cheese}, and 
the corresponding nodes are set to true, as de- 
picted in Figure 5. With likely and unlikely 
respectively equal to 0.99 and 0.01, the poste- 
rior probabilities are 3 P(F\m,a,b,c) = 0.9899 
and P(C\m,a,b,c) = 0.0101. Explaining away 
works. The posterior probability of COGNI- 
TION gets as low as its prior, whereas the 
probability of FOOD goes up to almost 1. A 
Bayesian network approach seems to actually 
implement the conservative strategy we thought 
to be the correct one for unsupervised learning 
of selectional restrictions. 

4.2 Computational issues in building 
BBNs based on Wordnet 

The implementation of a BBN for the whole of 
Wordnet faces computational complexity prob- 
lems typical of graphical models. A densely 
connected BBN presents two kinds of problems. 
The first is the storage of the CPTs. The size 
of a CPT grows exponentially with the number 
of parents of the node. 4 This problem can be 

[i F, C, m, a, b and c respectively stand for FOOD, 
COGNITION, meat, apple, bagel and cheese 

4 Some words in Wordnet have more than 20 senses. 
For example, line in Wordnet is associated with 25 



ENTITY 




BEVERAGE LAND 




java 



Figure 6: The subnetwork for drink. 



solved by optimizing the representation of these 
tables. In our case most of the entries have the 
same values, and a compact representation for 
them can be found (much like the one used in 
the noisy-OR model (Pearl, 1988)). 

A harder problem is performing inference. 
The graphical structure of a BBN represents 
the dependency relations among the random 
variables of the network. The algorithms used 
with BBNs usually perform inference by dy- 
namic programming on the triangulated moral 
graph. A lower bound on the number of com- 
putations that are necessary to model the joint 
distribution over the variables using such algo- 
rithms is 2l"l +1 , where n is the size of the max- 
imal boundary set according to the visitation 
schedule. 

4.3 Subnetworks and balancing 

Because of these problems we could not build a 
single BBN for Wordnet. Instead we simplified 
the structure of the model by building a smaller 
subnetwork for each predicate-argument pair. A 
subnetwork consists of the union of the sets of 
ancestors of the words in W + . Figure 6 pro- 
vides an example of the union of these "ances- 
tral subgraphs" of Wordnet for the words java 
and drink (compare it with Figure 1). 

This simplification does not affect the com- 
putation of the distributions we are interested 
in; that is, the marginals of the synset nodes. 
A BBN provides a compact representation for 
the joint distribution over the set of variables 

senses. The size of its CPT is therefore 2 26 . Storing a ta- 
ble of float numbers for this node alone requires around 
(2 26 )8 = 537 MBytes of memory. 



in the network. If N = X±, ...,X n is a Bayesian 
network with variables X\, ...,X n , its joint dis- 
tribution P(N) is the product of all the condi- 
tional probabilities specified in the network, 

P(N) = H PiXjlMXj)) (4) 
j 

where pa(X) is the set of parents of X. A BBN 
generates a factorization of the joint distribu- 
tion over its variables. Consider a network of 
three nodes A, B, C with arcs from A to B and 
C. Its joint distribution can be characterized as 
P(A,B,C) = P(A)P(B\A)P(C\A). If there is 
no evidence for C the joint distribution is 

P(A,B,C) = P(A)P{B\A)Y,P{C\A) 

c 

= P(A)P{B\A) 

= P{A,B) (5) 

The node C gets marginalized out. Marginaliz- 
ing over a childless node is equivalent to remov- 
ing it with its connections from the network. 
Therefore the subnetworks are equivalent to the 
whole network; i.e., they have the same joint 
distribution. 

Our model computes the value of P(c\p,r), 
but we did not compute the prior P(c) for all 
nouns in the corpus. We assumed this to be 
a constant, equal to the unlikely value, for all 
classes. In a BBN the values of the marginals 
increase with their distance from the root nodes. 
To avoid undesired bias (see table of results) we 
defined a balancing formula that adjusted the 
conditional probabilities of the CPTs in such a 
way that we got all the marginals to have ap- 
proximately the same value. 5 

5 Experiments and results 6 

5.1 Learning of selectional preferences 

When trained on predicate-argument pairs ex- 
tracted from a large corpus, the San Jose Mer- 
cury Corpus, the model gave very good results. 
The corpus contains about 1.3 million verb- 
object tokens. The obtained rankings of classes 
according to their posterior marginal probabili- 
ties were good. Table 1 shows the top and the 

5 More details can be found in an extended version of 
the paper: www.cog.brown.edu/~massi/. 

6 For these experiments we used values for the likely 
and unlikely parameters of 0.9 and 0.1, respectively. 



Ranking 


Synset 


P(c\p,r) 


1 


VEHICLE 


0.9995 


2 


VESSEL 


0.9893 


3 


AIRCRAFT 


0.9937 


4 


AIRPLANE 


0.9500 


5 


SHIP 


0.9114 








255 


CONCEPT 


0.1002 


256 


LAW 


0.1001 


257 


PHILOSOPHY 


0.1000 


258 


JURISPRUDENCE 


0.1000 



Table 1: Results for (maneuver, object) . 



bottom of the list of synsets for the verb ma- 
neuver. The model learned that maneuver "se- 
lects" for members of the class VEHICLE and 
of other plausible classes, hyponyms of VEHI- 
CLE. It also learned that the verb does not 
select for direct objects that are members of 
classes, like CONCEPT or PHILOSOPHY. 

5.2 Word sense disambiguation test 

A direct evaluation measure for unsupervised 
learning of SP models does not exist. These 
models are instead evaluated on a word-sense 
disambiguation test (WSD). The idea is that 
systems that learn SP produce word sense dis- 
ambiguation as a side-effect. Java might be in- 
terpreted as the island or the beverage, but in a 
context like "the tourists flew to Java" the for- 
mer seems more correct, because fly could select 
for geographic locations but not for beverages. 
A system trained on a predicate p should be 
able to disambiguate arguments of p if it has 
learned its selectional restrictions. 

We tested our model using the test and 
training data developed by Resnik (see Resnik, 
1997). The same test was used in (Abney 
and Light, 1999). The training data consists 
of predicate-object counts extracted from 4/5 
of the Brown corpus (about 1M words). The 
test set consists of predicate-object pairs from 
the remaining 1/5 of the corpus, which has 
been manually sense-annotated by Wordnet re- 
searchers. The results are shown in Table 2. 
The baseline algorithm chooses at random one 
of the multiple senses of an ambiguous word. 
The "first sense" method always chooses the 
most frequent sense (such a system should be 
trained on sense-tagged data). Our model per- 



Method 


Result 


Baseline 


no CO/ 


Abney and Light (HMM smoothed) 


oO.D/o 


Abney and Light (HMM balanced) 


42.3% 


Resnik 


44.3% 


BBN (without balancing) 


45.6% 


BBN (with balancing) 


51.4% 


First Sense 


82.5% 



Table 2: Results 



formed better than the state of the art models 
for unsupervised learning of SP. It seems to de- 
fine a better estimator for P(c\p, r). 

It is remarkable that the model achieved this 
result making only a limited use of distribu- 
tional information. A noun is in W + if it oc- 
curred at least once in the training set, but the 
system does not know if it occurred once or sev- 
eral times; either it occurred or it didn't. The 
model did not suffer too much from this limi- 
tation during this task. This is probably due 
to the sparseness of the training data for the 
test. For each verb the average number of ob- 
ject types is 3.3, for each of them the average 
number of tokens is 1.3; i.e., most of the words 
in the training data only occurred once. For 
this training set we also tested a version of the 
model that built a word node for each observed 
object token and therefore integrated the distri- 
butional information. On the WSD test it per- 
formed exactly the same as the simpler version. 
When trained on the San Jose Mercury Corpus 
the model performed worse on the WSD test 
(35.8%). This is not too surprising considering 
the differences between the SJM and the Brown 
corpora: the former, a recent newswire corpus; 
the latter, an older, balanced corpus. Another 
important factor is the different relevance of dis- 
tributional information. The training data from 
the SJM Corpus are much richer and noisier 
than the Brown data. Here the frequency in- 
formation is probably crucial; however, in this 
case we could not implement the simple scheme 
above. 

5.3 Conclusion 

Explaining away implements a cognitively at- 
tractive and successful strategy. A straightfor- 
ward improvement would be for the model to 



make full use of the distributional information 
present in the training data; we only partially 
achieved this. Bayesian networks are usually 
confronted with a single presentation of evi- 
dence. Their extension to multiple evidence is 
not trivial. We believe the model can be ex- 
tended in this direction. Possibly there are sev- 
eral ways to do so (multinomial sampling, ded- 
icated implementations, etc.). However, we be- 
lieve that the most relevant finding of this re- 
search might be that "explaining away" is not 
only a property of Bayesian networks but of 
Bayesian inference in general and that it might 
be implementable in other kinds of graphical 
models. We observed that the property seems to 
depend on the specification of the prior proba- 
bilities. We found that the HMM model of (Ab- 
ney and Light, 1999) was unidentifiable] that is, 
there are several solutions for the parameters of 
the model, including the desired one. Our intu- 
ition is that it should be possible to implement 
"explaining away" in a HMM with priors, so 
that it would prefer only one or a few solutions 
over many. This model would have also the ad- 
vantage of being computationally simpler. 
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